Setup

Load data

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

Part 1: Data

## [1] 651  32
##     title                  title_type                 genre    
##  Length:651         Documentary : 55   Drama             :305  
##  Class :character   Feature Film:591   Comedy            : 87  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65  
##                                        Mystery & Suspense: 59  
##                                        Documentary       : 52  
##                                        Horror            : 23  
##                                        (Other)           : 60  
##     runtime       mpaa_rating                               studio   
##  Min.   : 39.0   G      : 19   Paramount Pictures              : 37  
##  1st Qu.: 92.0   NC-17  :  2   Warner Bros. Pictures           : 30  
##  Median :103.0   PG     :118   Sony Pictures Home Entertainment: 27  
##  Mean   :105.8   PG-13  :133   Universal Pictures              : 23  
##  3rd Qu.:115.8   R      :329   Warner Home Video               : 19  
##  Max.   :267.0   Unrated: 50   (Other)                         :507  
##  NA's   :1                     NA's                            :  8  
##  thtr_rel_year  thtr_rel_month   thtr_rel_day    dvd_rel_year 
##  Min.   :1970   Min.   : 1.00   Min.   : 1.00   Min.   :1991  
##  1st Qu.:1990   1st Qu.: 4.00   1st Qu.: 7.00   1st Qu.:2001  
##  Median :2000   Median : 7.00   Median :15.00   Median :2004  
##  Mean   :1998   Mean   : 6.74   Mean   :14.42   Mean   :2004  
##  3rd Qu.:2007   3rd Qu.:10.00   3rd Qu.:21.00   3rd Qu.:2008  
##  Max.   :2014   Max.   :12.00   Max.   :31.00   Max.   :2015  
##                                                 NA's   :8     
##  dvd_rel_month     dvd_rel_day     imdb_rating    imdb_num_votes  
##  Min.   : 1.000   Min.   : 1.00   Min.   :1.900   Min.   :   180  
##  1st Qu.: 3.000   1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  4546  
##  Median : 6.000   Median :15.00   Median :6.600   Median : 15116  
##  Mean   : 6.333   Mean   :15.01   Mean   :6.493   Mean   : 57533  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 58300  
##  Max.   :12.000   Max.   :31.00   Max.   :9.000   Max.   :893008  
##  NA's   :8        NA's   :8                                       
##          critics_rating critics_score    audience_rating audience_score 
##  Certified Fresh:135    Min.   :  1.00   Spilled:275     Min.   :11.00  
##  Fresh          :209    1st Qu.: 33.00   Upright:376     1st Qu.:46.00  
##  Rotten         :307    Median : 61.00                   Median :65.00  
##                         Mean   : 57.69                   Mean   :62.36  
##                         3rd Qu.: 83.00                   3rd Qu.:80.00  
##                         Max.   :100.00                   Max.   :97.00  
##                                                                         
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :629      no :644      no :558        no :579          no :608     
##  yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box   director            actor1             actor2         
##  no :636    Length:651         Length:651         Length:651        
##  yes: 15    Class :character   Class :character   Class :character  
##             Mode  :character   Mode  :character   Mode  :character  
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##     actor3             actor4             actor5         
##  Length:651         Length:651         Length:651        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    imdb_url            rt_url         
##  Length:651         Length:651        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 
## [1] 1970
## [1] 2014

The “movies” dataset has 651 entries and 32 variables.

Generalizability

According to IMDB’s official website, it is indicated that there were 9454 movies released between the year 1970-2014. Click here Our current dataset includes 651 entries. This is well within 10% of the total dataset. Therefore, proving the usage of random sampling to obtain the sample dataset.

Causality

Causality can be explained only for an experimental dataset and not for an observational dataset. Since the “movies” dataset is an observational dataset, no comment regarding causality can be made.


Part 2: Research question

How do factors like genre, runtime, best picture nomination, best actor or actress win affect audience score apart from the regular factors like imdb rating, critics rating and year of release?

Why this question interests me:
Some people are not a fan of certain genres and there are chances that they would rate the movie low, even if it is a great movie in that genre. In my opinion, most movies which have been nominated for best picture at Oscar perform well with the audience. The popularity of actor and actress also affects how the audience rates the movie. Year and time of release play an important role too. Similarly, there are some people who depend on the critics rating and imdb rating before watching a movie. These are some reasons I’m interested in knowing how the above-stated factors affect audience score.


Part 3: Exploratory data analysis

We notice that although there are 32 variables present in the “movies” dataset, not every variable will aid us in answering the research question. Instead, we will create a new dataset called “movie_newset” created by using the variables from the “movies” dataset. We will include the variables that are useful in answering the research question.

## [1] 642  20
## Classes 'tbl_df', 'tbl' and 'data.frame':    642 obs. of  20 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "na.action")= 'exclude' Named int  100 184 261 334 345 375 377 437 451
##   ..- attr(*, "names")= chr  "100" "184" "261" "334" ...

Let’s look at how some categorical variables perform against audience score. This can be understood by using a density plot. The density plot gives information about the probability of a categorical variable having a higher audience score. We shall look at the performance of the following categorical variable against audience score: -

  1. title_type
  2. genre
  3. mpaa_rating
  4. critics_rating
  5. best_pic_nom
  6. best_dir_win
  7. best_actor_win
  8. best_actress_win
  9. top200_box

These performances can be used as a factor while selecting the model.

The following inference can be made by the above plots: -

From the above plots, title_type, genre, mpaa_rating, critics_rating, best_pic_nom, and top200_box give good insights about the audience score. We can use these categorical variables to build the model.

Let’s look at how some numerical variables perform against the response variable audience_score. The scatterplot will give us insights whether the plot is linear or not.

## # A tibble: 1 x 1
##   `cor(audience_score, imdb_rating)`
##                                <dbl>
## 1                              0.863
## # A tibble: 1 x 1
##   `cor(audience_score, critics_score)`
##                                  <dbl>
## 1                                0.700
## # A tibble: 1 x 1
##   `cor(audience_score, thtr_rel_year)`
##                                  <dbl>
## 1                              -0.0612
## # A tibble: 1 x 1
##   `cor(audience_score, imdb_num_votes)`
##                                   <dbl>
## 1                                 0.292
## # A tibble: 1 x 1
##   `cor(audience_score, dvd_rel_year)`
##                                 <dbl>
## 1                             -0.0638
## # A tibble: 1 x 1
##   `cor(audience_score, thtr_rel_month)`
##                                   <dbl>
## 1                                0.0399

We understand the folloiwing from the above models: -

Test for collinearity

Let’s check for collinearity, if there exists any, for all the numerical variables in the movies_newset dataset.

From the above plot, it is very evident that there is a high correlation between imdb_rating and critics_score. The correlation has a score of 0.762. Similarly, dvd_rel_year and thtr_rel_year has a correlation of 0.66.

The inference from the above statement is that usage of both imdb_rating and critics_score or dvd_rel_year and thtr_rel_year adds no value while making the model.


Part 4: Modeling

Based on the exploratory data analysis conducted. The audience score can be predicted using the following 10 variables: -

  1. title_type
  2. genre
  3. imdb_rating
  4. imdb_num_votes
  5. critics_rating
  6. best_pic_nom
  7. thtr_rel_year
  8. best_dir_win
  9. top200_box
  10. runtime

audience_score ~ title_type + runtime + genre + imdb_rating + imdb_num_votes + critics_rating + thtr_rel_year + best_pic_nom + best_dir_win + top200_box

## 
## Call:
## lm(formula = audience_score ~ title_type + runtime + genre + 
##     critics_rating + best_pic_nom + imdb_rating + imdb_num_votes + 
##     thtr_rel_year + best_dir_win + top200_box, data = movies_newset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.462  -5.923   0.122   5.622  49.257 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     1.631e+02  8.008e+01   2.036 0.042129 *  
## title_typeFeature Film          7.494e-01  3.621e+00   0.207 0.836109    
## title_typeTV Movie              2.116e+00  5.726e+00   0.370 0.711846    
## runtime                        -5.491e-02  2.348e-02  -2.339 0.019663 *  
## genreAnimation                  8.646e+00  3.517e+00   2.458 0.014228 *  
## genreArt House & International  3.963e-01  3.038e+00   0.130 0.896256    
## genreComedy                     2.162e+00  1.637e+00   1.320 0.187162    
## genreDocumentary                3.228e+00  3.889e+00   0.830 0.406878    
## genreDrama                      5.330e-01  1.429e+00   0.373 0.709273    
## genreHorror                    -5.245e+00  2.404e+00  -2.182 0.029510 *  
## genreMusical & Performing Arts  6.272e+00  3.365e+00   1.864 0.062804 .  
## genreMystery & Suspense        -5.366e+00  1.806e+00  -2.971 0.003086 ** 
## genreOther                      1.040e+00  2.793e+00   0.372 0.709847    
## genreScience Fiction & Fantasy -1.987e+00  3.685e+00  -0.539 0.589823    
## critics_ratingFresh            -2.364e+00  1.222e+00  -1.935 0.053425 .  
## critics_ratingRotten           -4.955e+00  1.321e+00  -3.750 0.000193 ***
## best_pic_nomyes                 2.818e+00  2.331e+00   1.209 0.227032    
## imdb_rating                     1.487e+01  5.372e-01  27.682  < 2e-16 ***
## imdb_num_votes                  4.435e-06  4.585e-06   0.967 0.333736    
## thtr_rel_year                  -9.495e-02  3.952e-02  -2.403 0.016562 *  
## best_dir_winyes                -1.594e+00  1.626e+00  -0.980 0.327316    
## top200_boxyes                   5.159e-01  2.724e+00   0.189 0.849841    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.762 on 620 degrees of freedom
## Multiple R-squared:  0.7729, Adjusted R-squared:  0.7652 
## F-statistic: 100.5 on 21 and 620 DF,  p-value: < 2.2e-16

The adjusted R-squared value for the current model is 0.7652 and the p-value is < 2.2e-16.

To fit the perfect model which could predict the response variable “audience_score”, we will make use of backward elimination method using adjusted R-squared as the criteria. The main reason why we’re using adjusted R-squared for model selection is because this criteria provides us with reliable prediction.

If we carry out backward elimination manually, it will be a time-consuming task. Instead we will use the step function to determine the model. Usage of this function will save a lot of time during model selection.

## 
## Call:
## lm(formula = audience_score ~ runtime + genre + critics_rating + 
##     imdb_rating + thtr_rel_year, data = movies_newset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.281  -6.029   0.190   5.483  49.551 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    144.78196   76.24389   1.899  0.05803 .  
## runtime                         -0.04714    0.02197  -2.145  0.03233 *  
## genreAnimation                   8.58994    3.50461   2.451  0.01452 *  
## genreArt House & International  -0.03871    2.99099  -0.013  0.98968    
## genreComedy                      2.10082    1.62211   1.295  0.19576    
## genreDocumentary                 1.92799    2.04548   0.943  0.34627    
## genreDrama                       0.36446    1.39008   0.262  0.79326    
## genreHorror                     -5.31573    2.39106  -2.223  0.02656 *  
## genreMusical & Performing Arts   5.33036    3.12207   1.707  0.08826 .  
## genreMystery & Suspense         -5.47038    1.79163  -3.053  0.00236 ** 
## genreOther                       1.52787    2.76016   0.554  0.58009    
## genreScience Fiction & Fantasy  -1.97938    3.67608  -0.538  0.59046    
## critics_ratingFresh             -2.97365    1.15186  -2.582  0.01006 *  
## critics_ratingRotten            -5.41478    1.27014  -4.263 2.33e-05 ***
## imdb_rating                     15.04544    0.51198  29.387  < 2e-16 ***
## thtr_rel_year                   -0.08598    0.03782  -2.273  0.02335 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.745 on 626 degrees of freedom
## Multiple R-squared:  0.7715, Adjusted R-squared:  0.766 
## F-statistic: 140.9 on 15 and 626 DF,  p-value: < 2.2e-16

The simplified model after using the backward elimination has 5 variables i.e runtime, genre, imdb_rating, critics_rating and thtr_rel_year. Runtime, imdb_rating, thtr_rel_year and critics_rating are the significant variables for this model. The adjusted R-squared value is 0.766 and the p value is < 2.2e-16.

The best_model becomes our parsimonious model here. According to the best_model: -

audience_score ~ runtime + genre + critics_rating + imdb_rating + thtr_rel_year

Model diagonostics for the Multiple Linear Regression Model

The following conditions need to be satisfied to confirm the model is a reliable model.

1. Linear Relationship between numerical variable and the response variable.

This can be checked by plotting the numerical variable against the residual value.

From the above three plots we notice that for all the three variables, residual is scattered randomly around 0. This shows that that there is a linear relationship between all the numerical variables and the residuals.

2. Nearly normal residuals with mean 0

This can be checked by plotting a histogram and a normal probability plot. Ideally, for a linear regression, the histogram should be centered around 0.

From the above plots, we observe that the residuals are indeed centered at 0. The histogram and the normal probability plot show that the plot is right-skewed but most of the residual is along the line indicating that the model is linear.

3. Constant variability of residuals

The constant variability of residuals condition can be checked by plotting the predicted values against the residuals. For a linear regression, the plot should be scattered randomly around 0 without having any shape.

The above plot shows that the scatter is random. This proves constant variability of residuals.

4. Independent Residuals

The reason for plotting the residuals is to check for the independence of residuals and identify if there exists any time series. We could check for the independence of residuals.

From the above plot, it is clear there is no time series present in the dataset. The residuals are scattered randomly around 0 indicating the independence of residuals.


Part 5: Prediction

To check how our final model “best_model” performs, let us see how our model performs on some examples. First, we try to predict the audience_score using one of the biggest blockbusters of 2018, “Avengers:Infinity War”.

##        1 
## 90.15883

Our model predicts that the audience_score is 90.16% while the real audience_score on Rotten Tomatoes is 91%. Our model’s estimate is very close to the original prediction. To check if the actual value of the audience_score falls within the interval, let us look at the interval prediction at 95% confidence level.

##        fit     lwr     upr
## 1 90.15883 69.6007 110.717

The actual value, 91%, falls well within the 95% confidence interval of (69.6007, 110.717).

Let’s look at another movie from a different genre with not so good imdb_rating or critics_rating to understand the prediction of the model. We’ll look at the audience_score prediction of the Bond movie “Live and Let Die”

##        1 
## 63.30584

The audience_score prediction according to our model was 63.3%. The actual value of the audience_score was 65%. The prediction value of our model is very close to the actual score.

##        fit      lwr      upr
## 1 63.30584 43.90693 82.70475

The actual audience_score falls within the 95% confidence interval of (43.90, 82.70).

With these two examples, it is evident that the “best_model” has good prediction and low error rates (0.92 % in the case of Avengers: Infinity War and 2.61% in case of Live and Let Die). The actual value of the audience_score is also within the 95% confidence interval. The references are in “Part 7: References” section.


Part 6: Conclusion

Based on our research, we conclude that runtime and genre were the two variables which were a better fit for predicting the audience score apart from the regular variables like imdb rating, critics rating and year of release. Contradictory to my assumptions, best actor or actress win did not influence the audience score as much as the genre or runtime. We can observe this in the density plot between audience score and best actor/actress win.

The parsimonious model obtained does a decent job in predicting the audience score which is almost close to the actual audience score. We can reduce the error rate by including other real-time variables like box office collections, popularity rating of the franchise (Eg. Marvel Cinematic Universe), etc.


Part 7: References

  1. Avengers: Infinity War Wikipedia page
  2. Avengers: Infinity War IMDb page
  3. Avengers: Infinity War Rotten Tomatoes page
  4. Step function documentation